Week 9: Text Analysis
rvest and politeRegular expressions provide a concise and flexible way to define patterns in strings.
At their most basic level, they can be used to match a fixed string, allowing the pattern to appear anywhere from one to multiple times within a single string.
str_view()
We will use stringr::str_view() to demonstrate various regular expression syntax.
The str_view() function highlights matching patterns by enclosing them in <>.
Metacharacters have special meaning in regular expression.
Common Metacharacters (1/3)
.: match any character except for \n^: match the starting position within the string$: match the ending position of the string|: match the expression before or the expression after the operatorQuantifiers control how many times a pattern matches.
Common Metacharacters (2/3)
*: match the preceding element zero or more times{m,n}: match the preceding element at least \(m\) and not more than \(n\) times?: match the preceding element zero or one time+: match the preceding element one or more timesA character set, allows you to match any character in a set.
Remember, \ needs to be escaped.
Common Metacharacters (3/3)
[]: match a single character that is contained within the brackets[^]: match a single character that is not contained within the brackets[a-z]: match any lower case letter[A-Z]: match any upper case letter[0-9]: match any number\d: match any digit\w: match any word character (letter and number)Parentheses create capturing groups, enabling you to work with specific subcomponents of a match.
Tip
You can reuse these groups in your pattern, where \1 refers to the match inside the first set of parentheses, \2 refers to the second, and so forth.
.* means 0 or more of any characterText analysis is a set of techniques that enable data analysts to extract and quantify information stored in the text, whether it’s from messages, tweets, emails, books, or other sources.
For example:
We will use the tidytext package for the first three steps and the gutenbergr package to obtain text data.
tidytextUsing tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use.
Lets’ start with a conversation from Game of Thrones:
Tidy text format is a table with one-token-per-row.
A token is a meaningful unit of text, such as a word, that we are interested in using for analysis.
Tokenization (unnest_tokens()) is the process of splitting text into tokens.
Use characters as tokens.
Ngrams are groups of words define by n.
The dataset consists of user and critic reviews for Animal Crossing: New Horizons, scraped from Metacritic.
This data was sourced from a #TidyTuesday challenge.
Warning
A value of 0 could indicate missing data!
Long reviews are compressed from the scraping procedure.
We will remove these characters from the text.
Use unnest_tokens() to convert the data into tidy text format.
Note
58% of reviewers write fewer than 75 words, while 36% write more than 150 words.
most users tend to provide brief feedback, while a smaller group of more engaged reviewers write longer, more detailed responses.
Note
Certain common words, such as “the” and “a,” don’t contribute much meaning to the text.
In computing, stop words are words that are filtered out before or after processing natural language data (text).
These words are generally among the most common in a language, but there is no universal list of stop words used by all natural language processing tools.
While stop words often do not add meaning to the text, they do contribute to its grammatical structure.
Lexicon: a word book or reference word book.
See ?get_stopwords for more info.
It is perfectly acceptable to start with a pre-made word list and remove or append additional words according to your particular use case.
You can replace filter() with an anti_join() call, but filter() makes the action clearer.
Note
The most common words are fitting, as the game is a popular Nintendo title for the Switch console, where players can create and play on their own island paradise with animal villagers.
Sentiment analysis is the process of determining the emotional tone or opinion expressed in a piece of text.
It is commonly used to analyze customer feedback, reviews, and social media.
Three widely used general-purpose lexicons for sentiment analysis are:
All three lexicons are based on unigrams (single word).
Use get_sentiments() to get the Lexicons.
inner_join() return rows from reviews if the word can be found in the Lexicon.
user_reviews_words %>%
inner_join(sentiments_bing) %>%
count(sentiment, word, sort = TRUE) %>%
arrange(desc(n)) %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
ggplot(aes(fct_reorder(word, n), n, fill = sentiment)) +
geom_col() +
coord_flip() +
facet_wrap(~sentiment, scales = "free") +
theme_minimal() +
labs(title = "Sentiments in user reviews", x = "") The average sentiment per review improves as the grade increases.
Some common words appear in both very positive and very negative reviews, so how do we determine their importance?
How do we measure the importance of a word to a document in a collection of documents?
For example a novel in a collection of novels or a review in a set of reviews…
We combine the following statistics:
The raw frequency of a word \(w\) in a document \(d\). It is a function of the word and the document.
\[ tf(w, d) = \frac{\text{count of } w \text{ in } d}{\text{total number of words in } d} \]
The term frequency for each word is the number of times that word occurs divided by the total number of words in the document.
For our reviews a document is a single user’s review. More about that here.
The inverse document frequency tells how common or rare a word is across a collection of documents. It is a function of a word \(w\), and the collection of documents \(\mathcal{D}\).
\[ idf(w, \mathcal{D}) = \log\left(\frac{\text{size of } \mathcal{D}}{\text{number of documents that contain }w}\right) \]
If every document contains \(w\), then \(\log(1) = 0\).
For the reviews data set, our collection is all the reviews. You could compute this in a somewhat roundabout as follows:
Multiply tf and idf together. This is a function of a word \(w\), a document \(d\), and the collection of documents \(\mathcal{D}\):
\[ tf\_idf(w, d, \mathcal{D}) = tf(w, d) \times idf(w,\mathcal{D}) \]
A high tf_idf value indicates that a word appears frequently in a specific document but is relatively rare across all documents.
Conversely, a low tf_idf value means the word occurs in many documents, causing the idf to approach zero and resulting in a small tf_idf.
tf_idfwe can use tidytext to compute those values:
user_reviews_words %>%
anti_join(stopwords_smart) %>%
count(user_name, word, sort = TRUE) %>%
bind_tf_idf(term = word, document = user_name, n = n) %>%
arrange(user_name, desc(tf_idf)) %>%
filter(user_name %in% c("Alucard0", "Cbabybear", "TheRealHighKing")) %>%
group_by(user_name) %>%
top_n(5) %>%
mutate(rank = paste("Top", 1:n())) %>%
ungroup() %>%
mutate(word = interaction(rank, word, lex.order = TRUE, sep = " : ")) %>%
mutate(word = `levels<-`(rev(word), rev(levels(word)))) %>%
ggplot() +
geom_col(aes(word, tf_idf)) +
facet_wrap(~user_name, ncol = 1, scales = "free_y") +
coord_flip()Text Mining with R has an example comparing historical physics textbooks:
Discourse on Floating Bodies by Galileo Galilei, Treatise on Light by Christiaan Huygens, Experiments with Alternate Currents of High Potential and High Frequency by Nikola Tesla, and Relativity: The Special and General Theory by Albert Einstein. All are available on the Gutenberg project.
Work your way through the comparison of physics books. It is section 3.4.
ETC1010/ETC5510 Lecture 9 | Melbourne time